Let's scrape some inmate data

Our goal in this exercise is to scrape the roster of inmates in the Hennepin County Jail into a CSV.

Step 1: Can we get everyone?

What happens when we click the search box without entering a first or last name? We're directed to a page with the listing of the entire roster at a new URL.

This is good news -- some forms are set up to require a minimum number of characters. Now we need to check whether you can just go to that URL without visiting the landing page first and clicking through -- in other words, does that page depend on a cookie being passed?

To test this, I usually open another browser window in incognito mode and paste in the URL. Success! Going to https://www4.co.hennepin.mn.us/webbooking/resultbyname.asp dumps out the entire list of inmates, so that's where we'll start. (You could also open your network tab and see what information is getting exchanged during the request. For more complex dynamically created pages that rely on cookies, we'd probably need the requests Session object.)

Step 2: Check out the inmate detail page

Let's click on an inmate link. We want to look at two things:

  • Does each inmate have a unique URL with a consistent pattern? (Yes)
  • What information on the page do we want to collect? (Let's grab custody info, housing location, booking date/time and arresting agency)

What's the pattern for an inmate URL?

Step 3: Start scraping

Import the libraries we'll need


In [ ]:
import csv
from datetime import datetime
import time

import requests
from bs4 import BeautifulSoup

Set introductory variables


In [ ]:
# base URL
url_base = 'https://www4.co.hennepin.mn.us/webbooking/'

# results page URL
results_page = url_base + 'resultbyname.asp'

# pattern for inmate detail URLs
inmate_url_pattern = url_base + 'chargedetail.asp?v_booknum={}'

Fetch and parse the page contents


In [ ]:
# fetch the page
r = requests.get(results_page)

# parse it
soup = BeautifulSoup(r.text, 'html.parser')

# find the table we want
table = soup.find_all('table')[6]

# get the rows of the table, minus the header
inmates = table.find_all('tr')[1:]

Write a couple of functions

We need to pause here and write a couple of functions to help us extract the bits of data from the inmate's detail page:

  • A function that takes the URL for an inmate detail page, fetches and parses the contents, then returns the bits of data we're interested in
  • A more specific function that takes the text of a label cell on a detail page ("Sheriff's Custody:", for instance) and returns the associated value in the next cell. This function will be called inside our other function -- it's not 100% necessary but it keeps us from repeating ourselves a million times

In [ ]:
def get_inmate_attr(soup, label):
    """Given a label and a soup'd detail page, return the associated value."""
    return soup.find(string=label).parent.parent.next_sibling \
                                  .next_sibling.text.strip()


def inmate_details(url):
    """Fetch and parse and inmate detail page, return three bits of data."""
    
    # fetch the page
    r = requests.get(url)
    
    # parse it into soup
    soup = BeautifulSoup(r.text, 'html.parser')
    
    # call the get_inmate_attr function to nab the cells we're interested in
    custody = get_inmate_attr(soup, "Sheriff's Custody:")
    housing = get_inmate_attr(soup, "Housing Location:")
    booking_date = get_inmate_attr(soup, "Received Date/Time:")

    # return a dict with this info
    # lose the " Address" string on the housing cell, where it exists
    # also, parse the booking date as a date to validate
    return {
        'custody': custody,
        'housing': housing.replace(' Address', ''),
        'booking_date': datetime.strptime(booking_date, '%m/%d/%Y.. %H:%M')
    }

Loop over the inmate rows, write to file


In [ ]:
# open a file to write to
with open('inmates.csv', 'w') as outfile:

    # define your headers -- they should match the keys in the dict
    # we're creating as we scrape
    headers = ['booking_num', 'url', 'last', 'rest', 'dob',
               'custody', 'housing', 'booking_date']

    # create a writer object
    writer = csv.DictWriter(outfile, fieldnames=headers)

    # write the header row
    writer.writeheader()

    # print some summary info
    print('')
    print('Writing data for {:,} inmates ...'.format(len(inmates)))
    print('')

    # loop over the rows of inmates from the search results page
    for row in inmates:
        
        # unpack the list of cells in the row
        booking_num, name, dob, status = row.find_all('td')
        
        # get the detail page link using the template string we defined up top
        detail_link = inmate_url_pattern.format(booking_num.string)
        
        # unpack the name into last/rest and print it
        last, rest = name.string.split(', ')
        print(rest, last)

        # reformat the dob, which, bonus, also validates it
        dob_parsed = datetime.strptime(dob.string, '%m/%d/%Y')

        # our dict of summary info
        summary_info = {
            'booking_num': booking_num.string,
            'url': detail_link,
            'last': last,
            'rest': rest,
            'dob': dob_parsed.strftime('%Y-%m-%d')
        }

        # call the inmate_details function on the detail URL
        # remember: this returns a dictionary
        details = inmate_details(detail_link)

        # combine the summary and detail dicts
        # by unpacking them into a new dict
        # https://www.python.org/dev/peps/pep-0448/
        combined_dict = {
            **summary_info,
            **details
        }

        # write the combined dict out to file
        writer.writerow(combined_dict)

        # pause for 2 seconds to give the server a break
        time.sleep(2)

Extra credit: Get charge details

It's all well and good to get the basic inmate info, but we're probably also interested in why they're in jail -- what are they charged with?

For this exercise, add some parsing logic to the inmate_details scraping function to extract data about what each inmate has been charged with. Pulling them out as a list of dictionaries makes the most sense to me, but you can format it however you like.

Because each inmate has a variable number of charges, you also need to think about how you want to represent the data in your CSV. Is each line one charge? One inmate? Picture how one row of data should look in your output file and structure your parsing to match.


In [ ]: